Search CORE

15 research outputs found

Pareto Principles in Infinite Ethics

Author: Askell Amanda
Publication venue
Publication date: 01/01/2018
Field of study

It is possible that the world contains infinitely many agents that have positive and negative levels of well-being. Theories have been developed to ethically rank such worlds based on the well-being levels of the agents in those worlds or other qualitative properties of the worlds in question, such as the distribution of agents across spacetime. In this thesis I argue that such ethical rankings ought to be consistent with the Pareto principle, which says that if two worlds contain the same agents and some agents are better off in the first world than they are in the second and no agents are worse off than they are in the second, then the first world is better than the second. I show that if we accept four axioms – the Pareto principle, transitivity, an axiom stating that populations of worlds can be permuted, and the claim that if the ‘at least as good as’ relation holds between two worlds then it holds between qualitative duplicates of this world pair – then we must conclude that there is ubiquitous incomparability between infinite worlds. I show that this is true even if the populations of infinite worlds are disjoint or overlapping, and that we cannot use any qualitative properties of world pairs to rank these worlds. Finally, I argue that this incomparability result generates puzzles for both consequentialist and non-consequentialist theories of objective and subjective permissibility

PhilPapers

Towards Understanding Sycophancy in Language Models

Author: Askell Amanda
Bowman Samuel R.
Cheng Newton
Durmus Esin
Duvenaud David
Hatfield-Dodds Zac
Johnston Scott R.
Korbak Tomasz
Kravec Shauna
Maxwell Timothy
McCandlish Sam
Ndousse Kamal
Perez Ethan
Rausch Oliver
Schiefer Nicholas
Sharma Mrinank
Tong Meg
Yan Da
Zhang Miranda
Publication venue
Publication date: 27/10/2023
Field of study

Human feedback is commonly utilized to finetune AI assistants. But human feedback may also encourage model responses that match user beliefs over truthful ones, a behaviour known as sycophancy. We investigate the prevalence of sycophancy in models whose finetuning procedure made use of human feedback, and the potential role of human preference judgments in such behavior. We first demonstrate that five state-of-the-art AI assistants consistently exhibit sycophancy across four varied free-form text-generation tasks. To understand if human preferences drive this broadly observed behavior, we analyze existing human preference data. We find that when a response matches a user's views, it is more likely to be preferred. Moreover, both humans and preference models (PMs) prefer convincingly-written sycophantic responses over correct ones a non-negligible fraction of the time. Optimizing model outputs against PMs also sometimes sacrifices truthfulness in favor of sycophancy. Overall, our results indicate that sycophancy is a general behavior of state-of-the-art AI assistants, likely driven in part by human preference judgments favoring sycophantic responses.Comment: 32 pages, 20 figure

arXiv.org e-Print Archive

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

We describe our early efforts to red team language models in order to simultaneously discover, measure, and attempt to reduce their potentially harmful outputs. We make three main contributions. First, we investigate scaling behaviors for red teaming across 3 model sizes (2.7B, 13B, and 52B parameters) and 4 model types: a plain language model (LM); an LM prompted to be helpful, honest, and harmless; an LM with rejection sampling; and a model trained to be helpful and harmless using reinforcement learning from human feedback (RLHF). We find that the RLHF models are increasingly difficult to red team as they scale, and we find a flat trend with scale for the other model types. Second, we release our dataset of 38,961 red team attacks for others to analyze and learn from. We provide our own analysis of the data and find a variety of harmful outputs, which range from offensive language to more subtly harmful non-violent unethical outputs. Third, we exhaustively describe our instructions, processes, statistical methodologies, and uncertainty about red teaming. We hope that this transparency accelerates our ability to work together as a community in order to develop shared norms, practices, and technical standards for how to red team language models

arXiv.org e-Print Archive

Language Models (Mostly) Know What They Know

We study whether language models can evaluate the validity of their own claims and predict which questions they will be able to answer correctly. We first show that larger models are well-calibrated on diverse multiple choice and true/false questions when they are provided in the right format. Thus we can approach self-evaluation on open-ended sampling tasks by asking models to first propose answers, and then to evaluate the probability "P(True)" that their answers are correct. We find encouraging performance, calibration, and scaling for P(True) on a diverse array of tasks. Performance at self-evaluation further improves when we allow models to consider many of their own samples before predicting the validity of one specific possibility. Next, we investigate whether models can be trained to predict "P(IK)", the probability that "I know" the answer to a question, without reference to any particular proposed answer. Models perform well at predicting P(IK) and partially generalize across tasks, though they struggle with calibration of P(IK) on new tasks. The predicted P(IK) probabilities also increase appropriately in the presence of relevant source materials in the context, and in the presence of hints towards the solution of mathematical word problems. We hope these observations lay the groundwork for training more honest models, and for investigating how honesty generalizes to cases where models are trained on objectives other than the imitation of human writing.Comment: 23+17 pages; refs added, typos fixe

arXiv.org e-Print Archive

Specific versus General Principles for Constitutional AI

Human feedback can prevent overtly harmful utterances in conversational models, but may not automatically mitigate subtle problematic behaviors such as a stated desire for self-preservation or power. Constitutional AI offers an alternative, replacing human feedback with feedback from AI models conditioned only on a list of written principles. We find this approach effectively prevents the expression of such behaviors. The success of simple principles motivates us to ask: can models learn general ethical behaviors from only a single written principle? To test this, we run experiments using a principle roughly stated as "do what's best for humanity". We find that the largest dialogue models can generalize from this short constitution, resulting in harmless assistants with no stated interest in specific motivations like power. A general principle may thus partially avoid the need for a long list of constitutions targeting potentially harmful behaviors. However, more detailed constitutions still improve fine-grained control over specific types of harms. This suggests both general and specific principles have value for steering AI safely

arXiv.org e-Print Archive

The moral inefficacy of carbon offsetting

Author: Askell Amanda
John Tyler M.
Wilkinson Hayden
Publication venue
Publication date
Field of study

Many real-world agents recognise that they impose harms by choosing to emit carbon, e.g., by flying. Yet many do so anyway, and then attempt to make things right by offsetting those harms. Such offsetters typically believe that, by offsetting, they change the deontic status of their behaviour, making an otherwise impermissible action permissible. Do they succeed in practice? Some philosophers have argued that they do, since their offsets appear to reverse the adverse effects of their emissions. But we show that they do not. In practice, standard carbon offsetting does not reverse the harms of the original action, nor does it even benefit the same group as was harmed. Standard moral theories hence deny that such offsetting succeeds. Indeed, we show that any moral theory that allows offsetting in this setting faces a dilemma between allowing any wrong to be offset, no matter how grievous, and recognising an implausibly sharp discontinuity between offsettable actions and non-offsettable actions. The most plausible response is to accept that carbon offsetting fails to right our climate wrongs

PhilPapers